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Applicable to a Wide Domain of Sorted Arrays of Floating Point 

Numbers 
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Abstract 

Given an array Z of V + 1 strictly ordered floating point number^ and an array Z of M floating 
point numbers belonging to the interval [Zo,Za^), a common problem in numerical methods is 
to find the indices of the largest numbers in the array X which are smaller or equal than the 
numbers in the array Z. The general solution to this problem is to call M times the well known 
binary search algorithm, which has complexity 0 {log 2 N) per individual search and, in its classic 
formulation, is not vectorizable, i.e. it is not SIMD friendly. This paper describes technical 
improvements to the binary search algorithm, which make it faster and vectorizable. Next it 
proposes a new vectorizable algorithm applicable to a wide set of X partitions, which is based 
on an indexing technique and allows to solve the problem with complexity 0(1) per individual 
search at the cost of introducing an initial overhead to compute the index and requiring extra 
memory for its storage. Test results using streaming SIMD extensions compare the performance 
of the algorithm versus various benchmarks and demonstrate its effectiveness. Depending on the 
test case, the algorithm performs up to 43 times faster than the classical binary search in single 
precision and 39 times faster in double precision. Applicability limitations and cache-friendliness 
related aspects are also discussed. 
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1. Introduction 

Given an array X of strictly increasing floating point numbers and a set of floating 

point numbers |Zj| ^ belonging to the interval [Zq, Zjv), a common problem in numerical meth¬ 
ods algorithms is to find the indices of the largest numbers in the array Z which are smaller or 
equal than the numbers Zj. 

This problem arises for instance in the context of piece-wise interpolation, where a domain 
[Zo,Za?) is partitioned in sub-intervals {[Z;,Z,+i))^q* and different interpolation functions gi(x) 
are associated with each sub-interval. To compute the interpolated value for a number Zj, the 
index i of the sub-interval containing it needs to be resolved first. 


* DISCLAIMER: Although Fabio Cannizzo is employed by Standard Chartered at the time this paper is written, this 
paper has been produced by Fabio Cannizzo in a personal capacity and Standard Chartered is not associated or responsible 
for its content in any way. 

^ a floating point number is a rationale number which can be represented exactly on a computer given a chosen floating 
point representation 
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This is a special case of the more general problem of searching in a sorted array for a match¬ 
ing element. The general solution to this problem is to call M times the binary search algorithm, 
which has complexity O {M log 2 N). An history of the binary search algorithm is discussed in 
llill . where the first published mention in the literature is attributed to J. Mauchly who in 1946 
proposed a binary search algorithm to find an exact match in a sorted array of length 2^. It is 
not until 1960 that a version of the algorithm that works for any N was published by D. Lehmer 
ili. The next step was taken by Bottenbruch iH , who presented a variation of the algorithm 
that avoid a separate test for equality until the very end, thus speeding up the inner loop. 

A well known variation of Bottenbruch’s algorithm, specialised to find the index of the largest 
numbers in the array X, which is smaller or equal than some number z e [Xo,Xff), is described 
by Press et al. [Q] and is referred to in the sequel as the classical version of the algorithm. 
This requires a control flow branch, which incurs penalties on many CPU architectures and is not 
vectorizable, i.e. it does not benefit from the vectorial capabilities of modern CPUs. Furthermore 
the algorithm is not cache-friendly as explained in Khuong and Morin et al. 10]. 

In some special cases, when either the Z, or the Zj numbers exhibit particular patterns, more 
efficient algorithms are available. Examples are when the numbers Z, are equally spaced or when 
the numbers Zj are sorted, and the problem can be solved with complexity 0{M) and 0{M + N) 
respectively. However no generic alternative exists. 

Many variations of the binary search algorithm have been developed to improve binary 
search. Fibonacci search, proposed by J. Kiefer E3 and D. Ferguson foll. has the same com¬ 
plexity as binary search, but improve search time when some region of the aiTay can be accessed 
faster than others. Exponential search, proposed by Bentley et al. llldll . for a single search has 
complexity 0 (log 2 i) where i is the sought index, but, due to increased computation cost, it be¬ 
comes an improvement over binary search only if the target value lies near beginning of the 
array. Ternary search solves the problem in log^N iterations, but it is still less efficient than bi¬ 
nary search, because at every iterations it performs two comparisons instead of one, hence the 
total number of comparisons is llogjN ^ 1.26 log 2 N > log 2 N. Several other variations have 
been analyzed by Morin et al. |0] with the intent to reduce branch mis-prediction and cache 
misses. 

This paper describes technical improvements to two variations of the binary search algorithm 
available in the literature, which makes them slightly faster and vectorizable. The complexity 
of the algorithms remains O (M log 2 N), but performance improves by a proportionality factor 
aid, where cr is a constant smaller than one due to the performance improvements and d is the 
number of floating point numbers which can be processed simultaneousl}l|, assuming perfect 
vectorizatioiH. 

Next it proposes a new vectorizable algorithm based on a indexing technique, which solves 
the problem with reduced complexity 0(M), at the cost of introducing an initial overhead to 
compute the index and requiring extra memory for its storage. Although the algorithm has fairly 
wide applicability, there are particular situations where the layout of the numbers Z, might make 
its use unfeasible. These limitations are analysed and discussed and a variations of the algorithm 
which mitigate them at the cost of sacrificing some performance are introduced. A cache-friendly 
version of the algorithm is also proposed. 


depends on the chosen set of vectorial instructions and floating point representation (e.g. with SSE instructions in 
single precision d = 4) 

^perfect vectorization means that all scalar instructions used in the algorithm have a vectorial equivalent 
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2. Problem Statement 


Let: 


• X be an array of + 1 floating point numbers sorted in ascending order, i.e. Z,_i < Z,, i - 
\. .N. The array X is assumed to be stored in a container supporting access by index in 
constant time. The numbers in X exhibit no special pattern. 

• Z be an unordered array of M floating point numbers contained between the first and the 
last element of the array X, i.e. Zj e [Zo,Za^), j - 0.. .M - 1. Note that the exclusion 
of the last point X^ does not imply any loss of generality, as X^ could simply be replaced 
with the next larger floating point number. 

for each number Zj the index i of the largest element in the array X such that Z,- < Zj is sought. 
Such array of indices is denoted as |/y| . 

It is assumed that M is large and memory is not scarce, therefore it is worth investing up¬ 
front in a preliminary analysis of the structure of the array Z and in the creation of an auxiliary 
data structure in order to later achieve superior computational performance when computing the 

I iM-l 

indices . 


3. Binary Search 

3.1. Classical Binaty Search 

In absence of any special pattern of the values Z,, the classical solution to this problem is the 
binary search algorithm, which in scalar version has complexity 0(log2 N) per individual search. 
A simple scalar pseudo-code implementation is given in algorithm[T] 


Algorithm 1 Classical Binary Search (scalar implementation) 
function BiNARYSEARCH(input: z, {Z,}^, output: i) 
low «— 0 
high «— N 

while high - low > 1 do > terminating condition depends indirectly on z 

mid <— {low + high)l2 

if z < Xmid then > code branch 

high «— mid 
else 

low «— mid 

end if 
end while 
i «— low 
end function 


A close analysis of this implementation highlights the following weaknesses: the body of the 
loop contains a branch instruction and for an arbitrary z the chances for the boolean condition to 
be true are about 50%, which makes branch prediction algorithms used in modern CPUs ineffec¬ 
tive; the algorithm is not easily vectorizable because, for different z the boolean condition may 
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resolve differently causing the program flow to take different code paths and possibly requiring 
a different number of iterations for the loop to complete. 

Because proper vectorization is not possible, a vectorial implementation is obtained trivially 
iterating on all elements of the array Z in steps of one element, with complexity 0(M log 2 N). 

3.2. Binary Search Revisited 

An improved variation of binary search proposed in |[lt] is based on the observations that 
some unnecessary extra iterations in the loop do not change the result and that the maximum 
possible number of potentially required iterations is |'/og 2 ^ 1 - 

Let p be the number of bits necessary to represent the number N, which is = 1 + llog 2 Ni, 
bk be the binary value taken by the k-th bit of the sought index i, Ck — and at — bkCk, the 
sought index i has binary representation Yjk=i ‘^k- 

The bits of the index can be resolved one by one starting from the highest order one as 
follows: first, if z > Xc^ then the p-th bit of the sought index i must be set, i.e. bp - I, next, if 
z > Xa^+cp-i '^hen the (p - l)-th bit of the index i must be set, i.e. bp-\ - 1, and so on, the values 
of the remaining bits are obtained iterating the procedure. 

This yields to algorithm|2l where the input argument P is the precomputed constant P - Cb - 
\log 2 N\. The number of iterations is fixed to p - 1, so no longer dependent on z, however, as 
the algorithm proceeds it can happen that the candidate index of the vector X exceeds the size of 
the vector, therefore, it requires a double boolean condition with short boolean evaluation, which 
makes it difficult to vectorize. 


Algorithm 2 Binary Search Revisited (scalar implementation) 

function BitSetBinarySearchI (input: z, P, output: i) 

p - 

i < r - 0 


k <- P 


repeat 


r «— i 1 k 

> bitwise OR 

if r < N && z> Xr then 

> short boolean evaluation 

i r 


end if 


k ^ kl 2 

> bitwise right shift 

until k - 0 


end function 



3.3. Vectorizable Binary Search Revisited 

The first boolean condition in algorithm can be avoided with a simple trick. Noting that 
Zj < Xn for any j, if the array X is extended to the right side by padding it with the last entry 
Xbi up to a size equal to the largest possible index representable with p bits (i.e. Q - 2^ - 1), 
the condition z > Xr would resolve to false for any r > N generated by the algorithm. This 
introduces an extra memory requirement of about (2^ - N)S bytes, where S is the size in bytes 
of the elements of array X (4 for single precision and 8 for double precision). 

This yields algorithm[3j which has still complexity 0(log2N), but does not contain any code 
branch (only a conditional assignment, which can be implemented efficiently with modern CPUs 
instruction set) and is easily vectorizable. 
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Algorithm 3 Vectorizable Binary Search Revisited (scalar implementation) 

function BiTSETBiNARYSEARCH2(input: z, P, output: i) > P - 

i <— 0 
k^P 

repeat 

r «— i I A: > bitwise OR 

if z > Ar then 

i <— r > conditional assignment 

end if 

A:«— A:/2 > bitwise right shift 

until k - 0 
end function 


A vectorial implementation is obtained vectorizing the function and iterating on all elements 
of the array Z in steps of d elements, where d is the number of floating point numbers which 
can be processed simultaneouslj0. Note that, working with streaming SIMD extensions perfect 
vectorization is not achievable, because the indices contained in vector r are not contiguous, 
therefore the d elements of the vector must be fetched from memory sequentially. 

If memory is scarce, an alternative to extending and padding the array X is to take at every 
iteration the maximum between r and N as the index of the element in X used for comparison. 
This can be resolved by the compiler without branching as a conditional assignment and leads to 
algorithmic This however leads to a loss of performance, which is still superior to the classic 
binary search, but inferior to algorithm |5j which is described in the next section and does not 
require any extra memory, therefore test results are not reported in section 1531 


Algorithm 4 Vectorizable Binary Search Revisited With No Padding (scalar implementation) 

function BiTSETBiNARYSEARCH3(input: z, {Vj^o, P, output: i) 

p - 2^-^°S2Ni 

i <r- 0 


k<^ P 


repeat 


r <— ( 1 A: 

> bitwise OR 

w <— min(r, N) 

> conditional assignment 

if z > A„ then 


i r 

> conditional assignment 

end if 


k^kl2 

> bitwise right shift 

until k - 0 


end function 



3.4. Offset-Based Binary Search 

A quite effective variation of the classic binary search algorithm is proposed in iQl, where, as 
the search range is progressively bisected, its current coordinates are stored as a pair of start index 


‘*d depends on the family of streaming SIMD extensions used and on the chosen number representation, e.g. single 
or double precision. 
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and range size, as opposite to a pair of start index and end index. This allows to perform only 
one conditional assignment to update the start index, as the size of the search range decreases 
deterministically hy a factor 1 /2. Furthermore, the algorithm is already optimized to be branch- 
free. 

The algorithnyrresented here contains a technical improvement with respect to the original 
one proposed in yy, as the number of iterations is fixed and not dependent on the input data (as 
in algorithm O, thus allowing for vectorization and helping the branch predictor. Furthermore 
the problem solved in this paper is slightly different from the one solved in ||2l, which search for 
the index i of the element of the array X matching exactly the number z, which leads to more 
differences. 

The pseudo-code is presented in algorithm|5j where the search function receives in input the 
following pre-computed constants 

M - (N + l)/2 mid index 

S - N + 1 - M mid size 

J - [log 2 (A^ + 1)J number of iterations 


Algorithm 5 Offset Based Binary Search (scalar implementation) 
function BiNARYSEARCHOFFSET(input: z, ^ ’ J output: i) 

i < — 0 

if z> Xf then > Assume at least one iteration, i.e. J > 0 

i ^ F > Conditional assignment 

end if 

while / > 0 do 
7 ^ 7-1 
H S/2 
F <^i + H 

if z >= Xf then > Conditional assignment 

i <- F 

end if 
S ^ F-H 
end while 
end function 


A vectorial implementation is obtained vectorizing the function and iterating on all elements 
of the array Z in steps of d elements, where d is the number of floating point numbers which 
can be processed simultaneously. Note that, working with streaming SIMD extensions perfect 
vectorization is not achievable, because the indices contained in vector F of algorithm[5]are not 
contiguous, therefore the d elements of the vector Xf must be fetched from memory sequentially. 

3.5. Cache Friendly Binary Search 

Modern CPUs exploit a small amount of integrated high performance memory called cache, 
which can be accessed much faster than the regular RAM. In simple terms this can be thought of 
as a temporary copy of various fr^ments of the RAM. Any data needed by the program is first 
sought in cache, and, if not founcO, is uploaded from RAM to cache, then used. As the cache is 
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when this is happens, it is said that a cache-miss event occurs, which has performance penalties 
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small, every time something is loaded into it, in order to make space some other piece of data 
must be displaced. 

Because cache memory is extremely fast, it is beneficial for a program to carry out as much 
work as possible on a data set small enough to ht in cache, so that the accesses to RAM, which 
are expensive, are minimized. An algorithm designed with this property in mind is said to be 
cache friendly. 

The degree of efficiency in the use of cache memory for various formulation of comparison 
based search algorithms has been studied systematically in 12] ■ The conclusion is that one of 
the most cache friendly implementation of the search algorithms relies on reordering the array X 
according with a special layout, called Eytzinger layout lll5n . 

The algorithm incurs some setup cost to reorder the array X and requires extra storage space 
to store it. An efficient implementation requires that the size of the original array Z is 2^ - 1 for 
some value L. Algorithm [6] describes the index search procedure, assuming that the Eytzinger 
layout has already been computed and stored in a new array Y. For an explanation of how the 
Eytzinger layout is organized, the reader is referred to ||2]- 

The problem solved in this paper is slightly different from the one solved in |12], which search 
for the index i of the element of the array X matching exactly the number z, therefore the algo¬ 
rithm proposed here is also slightly different. Furthermore, the array Y is padded with the last 
element of the array X^ until it reaches the size 2^ - 1, so that the algorithm does not require the 
introduction of checks on the index range and can work at its best efficiency. This trick, already 
used in algorithm[3j does not affect results, because as part of the problem statement z < X^. 

This introduces an extra memory requirement of about (2^ - N)S bytes, where S is the size 
in bytes of the elements of array X (4 for single precision and 8 for double precision). 

The detailed search procedure is listed in algorithm|6l which uses the following precomputed 
constants; 


L = 1 + Llog2(2 + N)\ 

M - not (2L) {not is the bitwise not operator) 


Algorithm 6 Eytzinger Search (scalar implementation) 

function EYxziNGERSEARCHfinput: z, M,L, output: i) 


P ^ 1 


if z > To then 


P^2 

> conditional assignment 

end if 


while L > 1 do 


1 


if z > T/j then 


G^2 

> conditional assignment 

end if 


P ^2P+Q 


L^L-\ 


end while 


i <- P&M 

> bitwise AND 

end function 
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A vectorial implementation is obtained vectorizing the function and iterating on all elements 
of the array Z in steps of d elements, where d is the number of floating point numbers which 
can be processed simultaneously. Note that, working with streaming SIMD extensions perfect 
vectorization is not achievable, because the indices contained in vector P of algorithm 0 are not 
contiguous, therefore the d elements of the vector Xp must be fetched from memory sequentially. 


4. Direct Search 


A scalar algorithm with complexity 0(1) per individual search can be obtained via construc¬ 
tion of an auxiliary function which maps real numbers z e [Aq, Aa?) directly to the sought indices 
i according with the following simple procedure. 

Let / be a function which maps floating point numbers z e [Aq, Ajv] to natural numbers in 
[0, R] for some value R and satisfies the following properties 


f(Xo) = 0 
/(Xn) = R 

'ia,b e [Ao,Aa^], a> b => /(a) > f{b) 
/(A,-+i) > /(A,), 


(la) 

(lb) 

(l c) 

i = 0...iV-l (Id) 


Let be a sorted array of natural numbers mapping the indices j generated by function / 

to indices of the array A as follows 


j 0, = 0 

\i, /(A,-i)<y</(A,) 

Possible pseudo-code to construct the array K as specified in (|2]l is proposed in algorithm lO. 
The definition lO of array K, implies the following property 


c = /(A,) ^ 


(3) 


Given a floating point number z e [A,, A,+i), property ( flcl l of function / guarantees that 


a = /(A,) < y = f{z) < /(A,+i) = b. 


and property (fTdl l implies that a < b. Because of (l3]l = i and Kj, - / -H 1, hence the index 

t - Kj can be either i or / H- 1 


i — Ka < t — Kj < Kh — i + I 

Since z e [A,-, A,+i), if r = i then z > A,, while if t = i -t 1 then z < A,, therefore the sought index 
can be trivially resolved comparing z with A,: 

■ _ i t - i if z < Af, 

1 t otherwise 


Algorithm 7 Initialization of array K (pseudo-code) 

function lNiTK(input: /(z), output: 

b^R ' 

i^N 

while > 0 do 

t <— f(Xi) 

while b > t do > always false at the first iteration, when i - N 

Kb <— j 
b b — I 

end while 

if = t then > always true at the first iteration, when i - N, thus j gets initialized 

j <- i 

Kb <— j 
b b - \ 

end if 

end while 
end function 


4.1. Proposed Choice for the Function f 

The search procedure described in section|4]relies on the existence of a function / satisfying 
properties O- Since the largest possible value generated by the function is R, which defines fhe 
size of the array |Aj| in order to minimize storage space requirements and initialization cost, 
it is desirable for R to be as small as possible. Furthermore, since function / is used in the search 
routine, it is desirable for it to be computationally fast. 

A possible choice, not necessarily optimal, is to use the simple formula 

(4) 

where H is an appropriately chosen constant. This is clearly a computationally efficient function, 
as its evaluation requires only a multiplication, a subtraction and a truncation. 

Condition (fTdl) requires that H satisfies the inequalities 

lH{Xi^i-Xo)i>lH(Xi-Xo)i, / = 0...A-1 (5) 

and the truncation operation can be removed writing the more restrictive system of inequalities 
H (Xi+i - Xo) > I + H {Xi - Xo), i^0...N-l (6) 

which yields the theoretical lower bound 


min; {A;+i - Xij^g^ 

Once a value for N satisfying the lower bound Q is chosen, R is simply determined applying 
formula (|4]l to X/y 


R=L^(Xb/-Xo)] 
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A possible value for the constant H which strictly satisfy inequality (|7]l could be obtained as 


Xn-Xq 

min; {Xi - X;_i )^j 


Xn-Xq 

In reality, as explained in section |4^ because of rounding errors this is not a robust approach to 
determine H and a different methodology must be used. 

Note that, because of the transformations performed of inequalities (O into more restrictive 
conditions, the lower bound H proposed in (O is not guaranteed to be optimal, i.e. it is not 
guaranteed to be the smallest possible value which would make function ([4| compatible with 
properties ([T]l. For instance, given the array X - {0,0.5,0.7,1.1}, equations (O yields H a; 6.36, 
however smaller values would also be admissible, e.g. H - 3. The determination of the optimal 
value of H satisfying properties O is not addressed in this paper. 

4.2. Algorithm 

Summarizing, using the choice for function / proposed in (|4|, for a given number z e 
[Xq, Xn) the index i such that z e [Xi, X;+i) can be obtained with the following procedure, given 
in pseudo-code in algorithm[8j 

1. compute the index j using (|4ll 

2. read the correspondent index i stored in Kj 

3. verify if the numbers z is to the right or to the left of Xi, and, if it is to the left, decrement 
the index i by one. 

Note that the last step does not involve any conditional jump. In the scalar case it can be resolved 
via conditional assignment. In the vectorial case with streaming SIMD extensions a floating point 
comparison operations returns a bit mask, which, if reinterpreted as a signed integer, is either 0 
or -1 and can be trivially added to i. 


(9a) 

(9b) 


Algorithm 8 Direct Search (scalar implementation) 

function DiRECTSEARCH(input: z, H output: i) 


j^[H(z-Xo)i 


i <- Kj 


if z < Xi then 



> conditional assignment 

end if 


end function 



Similarly to what already discussed for algorithm [3j working with Intel SIMD instructions 
perfect vectorization is not achievable, as fetching from memory the relevant elements of Kj and 
Xi must be done sequentially. 
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4.3. Geometric Interpretation 

The algorithm descrihed in section (14.21) has an intuitive equivalent geometric interpreta- 
tion. Let = 2^0 + jlH\ ._q be a fictitious array of equally spaced real numbers, the segments 

{[Fj, Fy+i)} have constant length IjH and overlap with the segments 

The function j - f{z) defined in (|4ll can be interpreted as the mapping from a real number 
z e [Zo,2(v) to the interval [Yj,Yj+i) which contains it. 

The array j defined in (|2]l can be interpreted as the mapping of the abstract segments 

|[T;, F/+i)| to elements of the array X as described below and illustrated in figure[T] 


Kj - max |i : X,- < Y^ 




1=1 


Given a real number z e [Xq, Xjv), the operations j - f(z) and t - Kj, identify respectively 
the segment {Yj, Yj+\) containing z and an associated element X,. Because of (|7]l, it is guaranteed 
that the segment [Yj, T/+i) is smaller than the smallest segment min,- therefore it 

can contain at most one single element of the array X. This means that either ^ < Yku 

implying that z could be either to the left or to the right of Xt, or X, < Yj, implying that z is 
always to the right of X,. In both cases, the sought index i is either t, if z < Xt, or f - 1 otherwise. 





Figure 1: Mapping of segments j[yj, Fj+i)} ._g to the points of the array X 


4.4. Dealing with Floating Point Rounding Errors 

Unfortunately the use of floating point arithmetic introduces a number of flaws in the proce¬ 
dure for the determination of H and R proposed in section |4~T] Some examples were computa¬ 
tions fails are illustrated below. 

• Given the partition X - {-10®, 0,1} in single precision, X2 — Xq gets rounded to 10® and it 
is effectively indistinguishable from Xi - Xq. Given function (IH, regardless on the choice 
of H, it is impossible for property (fTdll to hold. 

• Given the array X = {0,1.42 10 1) in single precision, equation ll9all overflows and 

yeilds R = -i-oo. 

Before continuing the discussion, it is useful to introduce some well known facts and nota¬ 
tions about floating point numbers (for a more comprehensive discussion on this topic see j^). 
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4.4.1. Facts and Notations about Floating Point Numbers 

Floating point numbers are a finite subset of rationale numbers and in a floating point numer¬ 
ical system any real numbers is approximated with its closest floating point number. 

The result of an algebraic operations might not be a floating point numbers even if its 
operands are floating point numbers and it is therefore affected hy a rounding error. 

Amongst real positive numbers, only those hounded in a certain interval can be approximated 
with floating point numbers. Numbers too small or too big respectively underflow to zero or 
overflow to the abstract concept of -too. 

Let; 

• mil) be the floating point approximation of a real number z 

• 4>ix) be the smallest floating point number greater than the floating point number x 

• e be the round-off error associated with a given floating point representation, which is 
e - 2“^^^ in single precision and e - 2“^^ in double precision 

The relative rounding error made when a real number x is approximated by the floating point 
number m(x) is bounded by 


1 


mix) 
e < —^ 

X 


< 1 + e 


( 10 ) 


4.4.2. Feasibility Conditions 

The algebraic operations involved in the computation of ([4]) and (|9]) are in general affected 
by rounding errors, therefore it is not guaranteed that the value of H they produce satisfies nu¬ 
merically condition ([5]). Taking into account rounding errors, conditions (O become 

lm(Hm(XM-Xo))i>lmiHmiXi-Xo))i, i^0...N-l (11) 

A necessary condition for (fTTT l to hold is that the results of all arithmetic subtractions involved 
must be numerically distinguishable, i.e. strictly increasing: 

m(A,+i - Xo) > m{Xi - Aq), / = 0... A - 1 (12) 

Using (ITOl l and taking repeatedly lower bound of the left hand side and a upper bound of the right 
hand side, conditions become 


(A,+i -Ao)(l 

-e) 

> 

(.Xi- 

Ao)(l + e), 

■ 

-A; 

> 

(Xm 

+ A;-2Ao)e, 

min{A,+i - A,- 

i 

xN-l 

U=0 

> 

(2Xn 

-2Ao)e 

min; {A;+i - A, ) 

N-l 


2e 



> 



Xn-Xo 


f = 0...A - 1 
f = 0...A - 1 


(13) 


Assuming conditions (IT2ll hold, conditions (fTTTl can be rewritten as a function of the rounded 
interval lengths {D; = m(Xi - Xo)}f_Q 


lm{HDi+i)i>[m(HDi)i, i^0...N-l (14) 
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the truncation operation can be resolved writing a more restrictive set of inequalities, 




( 15 ) 


and the theoretical lower bound O can be approximated numerically as 


H> H = m 


min,' {m(Di+i - AOljlo* 


(16) 


The size of the array K is R - In order to avoid numerical overflows, it must be 

R < 2^, where Q is the number of bits used for the index returned by function lO. For the 
function ([4]) to be efficiently vectorizable, it is desirable that Q - 32or Q e {32,64) depending if 
the array X is in single precision or in double precision. Note that Q is not necessarily the same 
as the number of bits used to represent elements of the array K, which is discussed in section l43] 
This requires that 


m{HDN) <2^ (17) 

Ignoring rounding errors, which might force H to be slightly larger than the theoretical lower 
bound ©, this limitation can be approximately expressed in terms of the layout of the original 
array X as 


Xn-Xq 

min; {Z;+i - Xi}fSQ 


< 2 ^ 


(18) 


and the approximate feasibility conditions ifTSll and (fTSll can be combined in one single expres- 
sion 


min; (Z ;+1 - 
Xn-Xo 


> max |2 2e| 


2“^^, in single precision, with Q - 32 
2“^^, in double precision, with Q - 32 
2“^', in double precision, with Q - 64 


(19) 


which approximately defines the family of arrays X where the method is applicable. 

Note that conditions dTSl) and (fTSl) are purely theoretical and cannot be verified exactly, be¬ 
cause of rounding errors. It is however easy and inexpensive to verify directly the original con¬ 
ditions dT^ and, given a value of H, ifTTll . 


4.4.3. Computation of a Feasible H 

Both the value for H computed in (IT^ and the evaluation of function (|4ll are affected by 
rounding error, therefore there is no guarantee that choosing H - H conditions llT4l) are satisfied. 
Should they not be, a larger value of H is needed. 

As discussed in section 1431 it is desirable for H to be as small as possible. That poses the 
difficult question of how much larger should H be? There is no obvious answer and in the sequel 
a feasible value of FI is computed numerically, by increasing FI in small amounts in a trial and 
error iterative process. In brief, after a value for H is chosen, the following two steps happen 
in a loop: condition ifTSl) is tested for all i and, if it does not hold, H is progressively increased 
according with some growth strategy. 

A possible growth strategy for H is described in algorithm |9l which increments H adding 
terms of exponentially increasing size, until a feasible value is found. 
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Algorithm 9 Computation of H and R (pseudo-code) 
function CoMPUTEHR(input: {A/)^q output: H, R) 

// <— 0 {h) > Initialize H strictly larger than the approximate lower bound (fTbll 

Dn ~ ^0 

if Dn > 2*J then > Check for overflow verifying condition (fTTIl 

ERROR; overflow, problem unfeasible 

end if 

P ^ (p (H) - H > Define a growth term P 

for i = 1... A do 
Di-i <— A,_i - Ao 
Di ^ Xi - Xo 

if D,_i = Di then > Verify that the sequence D, is strictly increasing 

ERROR: D, are not strictly increasing, problem unfeasible 

end if 

while \Ji Di^i\ - \ HDi\ do > Check if condition llT4l) is satisfied 

H <— H + P > Increase H 

if Dff > 2*J then > Check for overflow verifying condition (fTTI l 

ERROR: overflow, problem unfeasible 

end if 

P 2P > Double the growth term P 

end while 
end for 

R «— \_HXn\ > Compute R 

end function 
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A minor modification of algorithm 0 would easily allow to also track the last unfeasible 
value for H, thus defining an interval [Hunfeasible, Hfeasible] which is known to be unfeasible at its 
left extreme and feasible at its right extreme. Then instead of simply taking H - Hfeasible, the 
solution could be refined further by bisecting this interval. This has not been implemented in this 
paper, as in all scenarios tested in section ISAl H is extremely rarely increased, and, when it is, 
the increment is negligible in relative terms. 

4.5. Memory Cost 

Elements of the array K must have a size B in bytes sufficiently large to store the largest index 
of the array X, i.e. N. To allow efficient memory manipulation, it should be Z? e {1,2,4, 8), i.e. 

B = min{fje{l,2,4,8}:2^*>A} 

Algorithm[8]requires the allocation of R B bytes. The size R of the array K, apart from numerical 
related considerations, is approximately (l9al) . therefore, the total memory cost in bytes can be 
estimated with a quite high degree of accuracy as 


MemoryCost ^ 


Xn-Xq 

min,- {Z,+i - A, }^j 


B 


4.6. Initial Setup Cost 

The initial setup is divided in two parts, the computation of H performed in algorithm (|9]l and 
the initialization of the array K carried out in algorithm O. 

The latter has computational complexity 0{N + aR), as it requires evaluation of function (|4ll 
for all elements of the array X and an assignment for all elements of arrays K. The constant 
of proportionality a reflects the fact that the two operations do not have the same cost and it is 

Of 1. 

ft is more difficult to estimate precisely the computational complexity associated with the 
computation of H, as it is depends on the number of iterations of the inner loop in algorithm (O, 
which increases H if needed. In theory it is of order 0(N (I + T)), where T is the average number 
of iterations in the inner loop. In practice, experimental results in section (15.4b show that T a; 0, 
so the overall complexity is approximately 0{N). 

The last component of the setup cost is the memory allocation of the array K, which is de¬ 
pendent on the programming language, the operating system and the memory allocation strategy 
used. 

In section (15.4b some test results for the total setup cost, inclusive of memory allocation using 
the default gcc heap allocator, are reported. 

4. 7. Mitigating Limitations and Reducing Memory Consumption 

It is possible to reduce the memory consumed by the index and mitigate limitations by relax¬ 
ing the requirements of properties dMll as follows 

f(Xi^2)>fiXi), i^0...N-2 (20) 

The definition of the index K needs to be generalized as: 


Kj = max{f : /(A,) < ;} 
15 


( 21 ) 






From definition (l2Tll . it follows the properties 


K, 

Kc 


- i 


(22a) 

(22b) 


a = f{Xi) < f(XM) = b 
a = f{Xi) = /(X,+i) = b 


^Kb^i+\ 


Note that the modified definition of the array K does not require any change to algorithm lO. 


Given a real number z e [Z,, Z,+i), the sought index i can be resolved as 


i = Kj - I(z < X,) - I(z < Z,_i) 


(23) 


where j - f{z) and /(w) is the indicator function 


I(w) = 


1, if w is true 
0, otherwise 


The proof of this statement is more complicated than in the case discusses in section|4l Property 
(fTB of function / guarantees that 


a = f{Xi) < ;■ = fiz) < /(Z,+i) = b. 


which implies 

Ka<t^Kj< Kb, (24) 

There are three possible scenarios: 

1. if f(Xi) < /(Z/+i) < f(Xi+ 2 ), from property (I22all it follows that Ka - i, Kb - i + I, and 
(l24ll implies f e {i,i + 1} 

2. if f{Xj) < f{Xi+i) < /(Zi+ 2 ), from properties (l22ll it follows that Ka - i, Kb — i + 2, and 
(l24ll implies t e {i, i + 1, / + 2) 

3. if f(Xj) < f(Xi+\) < f(Xi+ 2 ), from properties (l22]l it follows that Ka - i + 1, Kb - i + 2, 
and (l24ll implies t e {i + l,i + 2] 

In all cases, t can only have one of the values in the set {/, / + 1, / + 2): 

• if t = i, equation (|23]| yields: i - I(z < Z,) - I(z < Z,_i) - i - 0 - 0 - i 

• if t = / + 1, equation ll2^ yields: i + I - I{z < Z,+i) - I(z < Z,) = i + 1 - 1 - 0 = i 

• if t = i + 2, equation (|23]) yields: i + 2 - I(z < Z,+ 2 ) - I(z < Z,+i) = / + 1 - 1 - 1 = i 

which completes the proof. 

Ignoring rounding errors, condition (|6ll becomes 


//(Z,4.2-Zo) > 1+//(Z,-Zo), (25) 

yielding a smaller lower bound on H 

H > 


min; {Z ;+2 - Xi}f^Q^ 
16 


(26) 



which results in a smaller value of R and therefore requires a smaller storage space for the auxil¬ 
iary array K. Note that algorithm|3needs to he modified accordingly. 


Limitations are also mitigated as conditions ( fT9b become 


min; {Xi +2 - 
Xn-Xo 


> max |2 2e| 


2“^^, in single precision, with Q - 32 
2~^^, in double precision, with Q - 32 
2“^', in double precision, with Q - 64 


(27) 


which are clearly less restrictive. 

Summarizing, given the function / and the array the search procedure is described in 

algorithm[TOl The algorithm is very similar to the one previously described, with the difference 


Algorithm 10 Direct Search Gap2 (scalar implementation) 

function DiRECTSEARCHGAp2(input: z, H, output: i) 

j^lH(z-Xo)i 
i <- Kj 

if z < Z, then 

/- 1 > conditional assignment 

end if 

if z < then 

> conditional assignment 

end if 

end function 


that the number z needs to be compared with both Z, and Z,_i, with consequent performance 
degradation, and the index i could be accordingly decreased by 1 or 2. The second conditional 
assignment could be made contingent on the first condition z < Z,, but in this case the first con¬ 
dition would no longer just trigger a conditional assignment and become a genuine code flow 
branch, hence it is preferable to check the two conditions independently. The downside is that, if 
t - 0, this causes access to the array element Z_i, therefore the array Z need to be padded to the 
left with one extra element containing the value Zq. This does not affect the correctness of the 
algorithm because the condition z < Zq is always false. 

Note that this same idea could be pushed even further, generalizing condition (fTdl) to 

/(Z;+,) > /(Z;), i^0...N-q (28) 

which requires comparisons of z with Z,, Z;_i,..., Zf_^. 

This would allow to construct an optimal solution deriving the minimal number of q which 
achieves a given memory budget. 

4.8. Cache Efficient Implementation 

Noting that in algorithm [TT] the following two steps always occur in close sequence: 

1. the index t - Kj is read from the array K 

17 






2. the element X, is read from the array X 

suggests to reorganize the data in memory so that Kj and X, are stored contiguously in memory 
and part of the same cache lineH. In this way, whenever Kj is fetched, Z, is loaded in cache at the 
same time and it is already available when it is needed for the execution of the next instruction, 
thus reducing the number of cache misses. 

This is easily achieved modifying the data type stored in the array K to be the pair of values 
{Kj,Xt), appropriately padded so that their joint memory storage requirement is either 8 or 16 
bytes to guarantee good cache line alignment. For instance working in single precision with 
32-bit indices, the pair requires exactly 8 bytes, while working in double precision with 32-bit 
indices, the pair requires 12 bytes and an extra 4 bytes padding needs to be appropriately inserted. 
Obviously this algorithm has higher memory requirements than algorithm [8j as elements of the 
array X are stored multiple times, plus there may be some wasted space for padding. 

The pseudo-code is illustrated in algorithm [TT] 


Algorithm 11 Direct Search Cache Friendly (scalar implementation) 

function DiRECTSEARCHWiTHCACHE(input: z, {X,}j^Q, H output: i) 

j^lH(z-Xo)i 

(/, x) <— Kj > Kj contains the pair of values (/, X,) 

if z < X then 

i <— / - 1 > conditional assignment 

end if 

end function 


5. Test Results 

5.1. Data Types Used 

All tests results presented in this section are performed for arrays X in single and double 
precision. For simplicity of implementation, the indices used in algorithm [8j [10] and [TT] are 
always 32-bits unsigned integers, regardless of the size of array K. Elements of the array K are 
always 32-bits unsigned integers, regardless of the size of the array X. 

5.2. Array X Layout 

All tests results presented in this section are based on arrays X of size 2^-1, for P - 
{4, 8,12,16,20). The choice of this particular set of array sizes, near to powers of 2, is simply 
aimed at optimizing memory usage for algorithms [3] and 0 and has no impact on performance 
measurements for algorithmsISlfTOlandfTTl 

Regardless of the size, all arrays X are composed of sub intervals of random length (X,+i -X,) 
sampled for each i from a uniform distribution in [1,5]. 

The motivation associated with the choice of this particular layout is merely to assure that 
the feasibility condition ( iT9] ) for algorithm[8]is satisfied in probabilistic terms even for the largest 
size of the array X tested. In fact, the expected length of an interval E[X,+i - X,] is 3 and, when 


typical cache line is 64 bytes long and starts at memory addresses which are multiple of 64 
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the size of the array X is N ^ condition Cl holds with good margin both in single and 
double precision 


min, {X,- ^ 1 _ 1 _ 1 =15 2“^' > 2“^^ 

Xn-Xo - X^-Zo ~ E[Z^-Zo] 3-220 

Furthermore, since the elements of the array Z are uniformly distributed in [Zo,Zjv), the 
choice of a layout X with intervals of comparable length, assures that there is no major dispro¬ 
portion in the number of times each element of the array X is picked, thus avoiding test results to 
be affected by cache memory related effects. 

Alternative layouts could be tested, and, with respect to algorithms O[T0] and [TT] they might 
lead to partitions either feasible or unfeasible, but this would make no difference for the per¬ 
formance measurements carried out, except perhaps for considerations associated with cache 
memory. 

5.3. Performance Tests 

5.3.1. Performance Test Details 

Tables [T][5] compare the throughput of various algorithms for arrays X of increasing size, as 
described in section [521 Results obtained usiM the Intel Math Kernel Library (MKL), which 
uses a proprietary algorithm, are also includedQ 

For most algorithms there are 3 possible implementations for each floating point representa¬ 
tion, where the set of instructions used is either scalar or vectorial with vectors of 128 bits (Intel 
SSE instruction set) or 256 bits (Intel AVX instruction set). Exceptions are algorithms [1] where 
the vectorial implementation consists in simply calling in a loop the scalar implementation with 
function Mining enabled, and MKL, where the scalar implementation consists in calling the vec¬ 
torial implementation with vectors of size 1 and there is no way to force use of only SSE or AVX 
instructions. In both cases the same identical algorithm is used to produce the numbers reported 
in the SSE and in the AVX section. 

For each floating point representation and set of instructions the results are expressed as the 
number of indices resolved per unit of time. The setup-cost, for those algorithms requiring it, is 
excluded from the reported results. 

The array Z is stored aligned on a 32 bytes boundary, for efficient vectorial memory access. 
Its elements have no particular pattern and are randomly uniformly distributed in [Xq, Xn). 

5.3.2. Performance Test Observations 

MKL. MKL performance is generally superior to classic binary search, but significantly inferior 
to any of the enhanced versions of binary search discussed. The scalar case in particular, for 
small arrays, is even slower than classic binary search. 

Enhanced Versions of Binary Search. The three enhanced versions of binary search (algorithms 
[3j|5]and|^ for arrays of small to medium size exhibit comparable performance in the vectorial 
case and only minor differences in the scalar case, with algorithm[3]being faster. For large arrays, 
as expected, the cache friendly algorithm[6]beat the others. 


^The version of MKL used is 2017. The functions used are dfsSearchCellslD in single precision and dfdSearch- 
CellslD in double precision. The partition X is initialized with the QUASLUNIFORM hint, which yields experimental 
results superior to the alternatives for the particular layouts of the aiTay X used in this test 
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Direct Search. All versions of direct search (algorithms [8l ITOl fTTI) yield a throughput increase 
of at least one order of magnitude with respect to binary search algorithms. This is no surprise 
given the difference in computational complexity. As expected, algorithm[TO]has slightly inferior 
performance than algorithm[8] due to the increased computation cost. AlgorithmlTTIperformance 
is superior to all others for large vectors, where an efficient utilization of cache memory becomes 
critical. 

General Observations. All algorithms show a significant speed-up switching from the scalar to 
the SSE vectorial implementation, while switching from SSE to AVX leads to less consistent 
improvements, which vary with the algorithm, the precision and the array size. Note that the 
throughput increase factor achievable via vectorization is expected to be smaller than d. This is 
because, using Intel SIMD instructions, as explained in previous sections, perfect vectorization 
is not achievable. 

5.4. Direct Search Setup Cost Test 

5.4.1. Direct Search Setup Cost Test Details 

As discussed in section lAhl the setup cost is uncertain because of the unpredictable number of 
times H is increased in the inner loop of|9l Therefore all test results in this section are described 
with statistical properties over populations of randomly generated arrays X. 

Table [6] presents the number of times H is increased in the inner loop of algorithm |9] A 
sample population of 10000 randomly generated arrays X is used. 

Table |7] presents the setup cost in nanoseconds normalized by the size of the array X. A 
sample population of 1000 randomly generated arrays X is used. 

Multiplying the test results in table |7] for the array size yields the average setup time in 
nanoseconds. This allow for a direct comparison with the results presented in tables [T]l5j which 
are expressed in million of searches per seconds, making it possible to express the setup costs in 
terms of equivalent number of searches. 

5.4.2. Direct Search Setup Cost Test Observations 

Table [6] shows that the number of times H is increased in the inner loop of algorithm |9] is 
independent on N. At most one iteration is carried out, regardless of the size of the array X. 
This implies that the initial guess (O is acceptable in almost all cases and, when an increase 
is necessary, that is negligible in relative terms. Hence the approximate limiting conditions (fT9l l 
which is derived from dl have good general accuracy. 

Table|7]shows that the setup cost normalized by the size of the array X, initially decreases as 
the size of the array X grows, but then it increases until it stabilize on an asymptotic level. 

The initial decrease can be explained as the effect of the fixed overhead associated with the 
algorithm, due for instance to the allocation of memory for the array K, which is material for a 
small array, but gets amortized over larger arrays. 

The subsequent increase can be explained by the larger amount of total memory necessary to 
store the arrays X and K, which results in progressively inefficient use of the cache memory. 

From these experimental results it is possible to conclude that, a part from memory I/O 
effects, the setup cost per element is independent on N, i.e. the setup has complexity 0{N). 
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5.5. Reproducibility 

All test results presented in section|5]have been produced using the C++ source code publicly 
available from githutH, compiled for Linux 64-bits platform with gcc 6.3.1 on a machine with 
AVX-2 capabilitjB 

6. Conclusion 

Possible improvements to the classic binary search algorithm have been described. Although 
having the same asymptotic complexity, these are vectorizable and generally faster. 

Next a new algorithm with superior asymptotic complexity has been presented. Test results 
using streaming SIMD extensions demonstrate that with arrays X of various length randomly 
populated the proposed algorithm is up to 43 times faster than the classical binary search in 
single precision and up to 39 times faster in double precision. 

A cache friendly version of the algorithm is also proposed, which in most cases leads to even 
superior performance. 

There are situations where the algorithm is not applicable. These are extensively discussed 
and can be inexpensively identified in the preliminary analysis phase, thus allowing fail-back 
to some of the enhanced versions of the binary search algorithm. Possible variations of the 
algorithm, which mitigate such limitations sacrificing some performance or reduce cache misses 
at the cost of requiring more memory, have also been proposed. 


References 

7. References 

[1] 2011, Pulver, Binary Search Revisited, URL: http://eigenjoy.eom/2011/09/09/binary-search-revisited/ 

[2] 2015, R Khuong, P. Morin, AiTay Layout for Comparison Based Searching, ARXIV 

[3] 1997, D. Knuth, Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley, second 
edition. 

[4] 2007, W. Press, S. Teukolsky, W. Vetterling, B.Flannery, Numerical Recipes, The Art of Scientific Computing, 
Sections 3.1, Third Edition, Cambridge University Press 

[5] 1991, D. Goldberg, What Every Computer Scientist Should Know About Floating-Point Arithmetic, Computing 
Surveys, Association for Computing Machinery 

[6] 1982, L. Johnson, R. Riess, Numerical Analysis, Second Edition, Addison Wesley 

[7] 2008-754 IEEE Standard for Floating-Point Arithmetic, The Institute of Electrical and Electronics Engineers, 2008 

[8] 2012, P. Khuong, Binary search is a pathological case for caches. Some Lisp, URL: 
http://www.pvk.ca/Blog/2012/07/30/binary-search-is-a-pathological-case-for-caches/ 

[9] 1946, J. Mauchly, Theory And Techniques for Design of Electronic Digital Computers, Lectures given at the Moore 
School of Electrical Engineering 

[10] 1960, D. H. Lehmer, Proceedings of Symposia in Applied Mathematics 10 

[11] 1962, H. Bottenbruch, JACM 9 

[12] 1953, J. Kiefer, Sequential minimax search for a maximum, Proc. American Mathematical Society 4, 502506. 

[13] 1960, D. Ferguson, Fibonaccian searching. Communications of the ACM, vol. 3, is. 12, p. 648 

[14] 1976, J. Bentley, A. Yao, An almost optimal algorithm for unbounded searching, Infonnation Processing Letters. 5 
(3): 8287 

[15] 1590, M. Eytzinger, Thesaurus principum hac aetate in Europa viventium (Cologne) Aitsingero, Aitsingerum, 
Eyzingem. 


^https://github.com/fabiocannizzo/fastbinarysearch.git 
^CPU model: Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz 

21 




Single Precision 

Double Precision 

Scalar 

d=l 

SSE-4 

<1=4 

AVX-2 

d=8 

Scalar 

d=l 

SSE-4 

d=2 

AVX-2 

d=4 

Algorithm [I] 

48.30 

48.96 

51.28 

47.67 

50.86 

49.47 

Algorithmic 

177.18 

375.26 

396.64 

197.96 

263.19 

214.13 

Algorithmic 

180.45 

369.56 

396.06 

182.14 

255.83 

213.23 

Algorithmic 

161.30 

374.44 

391.60 

160.98 

254.03 

207.07 

MKL 2017 

34.55 

167.08 

167.55 

31.70 

152.89 

151.93 

Algorithm m 

331.11 

726.62 

728.05 

330.27 

492.58 

553.76 

Algorithm nol 

207.78 

576.57 

607.32 

207.08 

410.46 

453.70 

Algorithm HI] 

328.13 

732.89 

754.83 

312.39 

570.22 

469.42 


Table 1: Throughput in million of searches per second with vector X of size 15 



Single Precision 

Double Precision 

Scalar 

d=l 

SSE-4 

d=4 

AVX-2 

d=8 

Scalar 

d=l 

SSE-4 

d=2 

AVX-2 

d=4 

Algorithm U 

20.55 

20.40 

20.50 

20.45 

19.94 

20.46 

Algorithmic 

80.49 

122.18 

148.41 

81.29 

89.98 

81.36 

Algorithmic 

73.47 

122.87 

148.01 

74.05 

90.36 

81.15 

Algorithmic 

65.63 

122.45 

146.56 

65.25 

88.94 

79.97 

MKL 2017 

24.95 

83.44 

83.50 

22.43 

84.18 

84.21 

Algorithmic 

331.26 

722.33 

728.57 

330.16 

491.94 

554.18 

Algorithm UOl 

207.68 

575.63 

605.53 

206.71 

409.92 

451.44 

Algorithm nil 

328.17 

730.32 

733.19 

312.16 

569.70 

470.62 


Table 2: Throughput in million of searches per second with vector X of size 255 



Single Precision 

Double Precision 

Scalar 

d=l 

SSE-4 

d=4 

AVX-2 

d=8 

Scalar 

d=l 

SSE-4 

d=2 

AVX-2 

d=4 

Algorithm|I| 

14.08 

14.15 

14.16 

13.95 

13.87 

14.10 

Algorithmic 

45.30 

70.75 

91.33 

44.45 

47.07 

48.48 

Algorithmic 

42.98 

70.96 

91.14 

41.82 

47.39 

48.42 

Algorithmic 

38.51 

70.34 

90.00 

37.95 

46.59 

48.24 

MKL 2017 

21.57 

53.52 

53.54 

19.23 

51.65 

51.65 

Algorithmic 

309.78 

642.61 

618.90 

302.05 

427.90 

487.01 

Algorithm nol 

195.31 

544.69 

548.09 

186.98 

352.28 

396.78 

Algorithm [IT] 

302.19 

705.73 

639.64 

259.23 

476.40 

387.07 


Table 3: Throughput in million of searches per second with vector X of size 4095 
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Single Precision 

Double Precision 

Scalar 

d=l 

SSE-4 

d=4 

AVX-2 

d=8 

Scalar 

d=l 

SSE-4 

d=2 

AVX-2 

d=4 

Algorithm [I] 

9.05 

9.11 

9.10 

7.13 

7.13 

7.20 

Algorithmic 

20.89 

28.86 

38.04 

12.81 

11.92 

15.47 

Algorithmic 

18.45 

26.39 

35.04 

12.28 

11.84 

15.44 

Algorithmic 

22.31 

44.63 

50.08 

17.55 

25.73 

22.52 

MKL 2017 

15.39 

26.21 

26.21 

13.12 

22.89 

22.89 

Algorithmic 

159.67 

345.02 

349.01 

148.88 

246.81 

281.38 

Algorithm UOl 

118.57 

315.81 

331.39 

108.98 

199.31 

241.49 

Algorithm HI] 

239.46 

547.16 

511.86 

218.39 

394.57 

333.06 


Table 4: Throughput in million of searches per second with vector X of size 65535 



Single Precision 

Double Precision 

Scalar 

d=l 

SSE-4 

d=4 

AVX-2 

d=8 

Scalar 

d=l 

SSE-4 

d=2 

AVX-2 

d=4 

Algorithm|I| 

4.59 

4.62 

4.62 

3.70 

3.70 

3.73 

Algorithmic 

5.49 

10.68 

17.11 

4.84 

5.23 

7.15 

Algorithmic 

6.17 

11.14 

17.44 

5.04 

5.40 

7.33 

Algorithmic 

9.44 

18.79 

23.76 

7.94 

9.98 

10.95 

MKL 2017 

7.83 

9.99 

9.98 

6.78 

8.74 

8.74 

Algorithmic 

62.46 

61.68 

59.37 

53.02 

52.68 

50.69 

Algorithm HOl 

67.08 

75.00 

72.22 

57.75 

60.51 

58.73 

Algorithm |llj 

105.96 

96.31 

92.64 

82.17 

82.43 

76.13 


Table 5: Throughput in million of searches per second with vector X of size 1048575 



Single Precision 

Double Precision 

array size 

mean 

min 

max 

stdev 

mean 

min 

max 

stdev 

15 

0.0061 

0.0000 

1.0000 

0.0779 

0.0048 

0.0000 

1.0000 

0.0691 

255 

0.0000 

0.0000 

0.0000 

0.0000 

0.0001 

0.0000 

1.0000 

0.0100 

4095 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

65535 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

1048575 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 


Table 6: Number of H updates 



Single Precision 

Double Precision 

array size 

mean 

min 

max 

stdev 

mean 

min 

max 

stdev 

15 

10.24 

8.96 

11.59 

0.47 

11.81 

10.11 

13.31 

0.49 

255 

8.27 

8.00 

8.66 

0.12 

8.40 

8.06 

8.65 

0.13 

4095 

13.80 

13.54 

13.98 

0.09 

13.51 

13.30 

13.70 

0.07 

65535 

17.63 

17.61 

17.65 

0.01 

17.49 

17.46 

17.51 

0.01 

1048575 

17.67 

17.66 

17.69 

0.01 

17.61 

17.58 

17.66 

0.02 


Table 7: Statistical setup cost for algorithm[^in nano seconds normalized by the array size 
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